14 research outputs found
Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors
The QR decomposition with column pivoting (QRP) of a matrix is widely used for rank revealing. The performance of LAPACK implementation (DGEQP3) of the Householder QRP algorithm is limited by Level 2 BLAS operations required for updating the column norms. In this paper, we propose an implementation of the QRP algorithm using a distribution of the matrix columns in a round-robin fashion for better data locality and parallel memory bus utilization on multicore architectures. Our performance results show a 60% improvement over the routine DGEQP3 of Intel MKL (version 10.3) on a 12 core Intel Xeon X5670 machine. In addition, we show that the same data distribution is also suitable for general purpose GPU processors, where our implementation obtains up to 90 GFlops on a NVIDIA GeForce GTX480. This is about 2 times faster than the QRP implementation of MAGMA (version 1.2.1).Tom ́as and Bai were supported in part by the U.S. DOES ciDAC grant DOE-DE-FC0206ER25793 and NSF grant PHY1005502. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. DOE under Contract No. DE-AC02-05CH11231.Tomás Domínguez, AE.; Bai, Z.; Hernández García, V. (2013). Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors. En High Performance Computing for Computational Science - VECPAR 2012. Springer Verlag (Germany): Series. 50-58. https://doi.org/10.1007/978-3-642-38718-0_8S5058Bischof, C.H.: A parallel QR factorization algorithm with controlled local pivoting. SIAM J. Sci. Stat. Comput. 12, 36–57 (1991)Chandrasekaran, S., Ipsen, I.C.F.: On rank-revealing factorisations. SIAM J. Matrix Anal. Appl. 15, 592–622 (1994)Castaldo, A.M., Whaley, R.C.: Scaling LAPACK panel operations using parallel cache assignment. In: 15th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pp. 223–231 (2010)Drmač, Z., Bujanović, Z.: On the failure of rank-revealing QR factorization software – a case study. ACM Trans. Math. Softw. 35, 12:1–12:28 (2008)Drmač, Z., Veselić, K.: New fast and accurate Jacobi SVD algorithm I. SIAM J. Matrix Anal. Appl. 29, 1322–1342 (2008)Drmač, Z., Veselić, K.: New fast and accurate Jacobi SVD algorithm II. SIAM J. Matrix Anal. Appl. 29, 1343–1362 (2008)Golub, G.H.: Numerical methods for solving linear least squares problems. Numer. Math. 7, 206–216 (1965)Gu, M., Eisenstat, S.: Efficient algorithms for computing a strong rank-revealing QR factorization. SIAM J. Sci. Comput. 17, 848–869 (1996)Quintana-Orti, G., Sun, X., Bischof, C.H.: A BLAS-3 version of the QR factorization with column pivoting. SIAM J. Sci. Comput. 19, 1486–1494 (1998)Schreiber, R., van Loan, C.: A storage-efficient WY representation for products of Householder transformations. SIAM J. Sci. Stat. Comput. 10, 53–57 (1989
Interpolatory methods for model reduction of multi-input/multi-output systems
We develop here a computationally effective approach for producing
high-quality -approximations to large scale linear
dynamical systems having multiple inputs and multiple outputs (MIMO). We extend
an approach for model reduction introduced by Flagg,
Beattie, and Gugercin for the single-input/single-output (SISO) setting, which
combined ideas originating in interpolatory -optimal model
reduction with complex Chebyshev approximation. Retaining this framework, our
approach to the MIMO problem has its principal computational cost dominated by
(sparse) linear solves, and so it can remain an effective strategy in many
large-scale settings. We are able to avoid computationally demanding
norm calculations that are normally required to monitor
progress within each optimization cycle through the use of "data-driven"
rational approximations that are built upon previously computed function
samples. Numerical examples are included that illustrate our approach. We
produce high fidelity reduced models having consistently better
performance than models produced via balanced truncation;
these models often are as good as (and occasionally better than) models
produced using optimal Hankel norm approximation as well. In all cases
considered, the method described here produces reduced models at far lower cost
than is possible with either balanced truncation or optimal Hankel norm
approximation
A GPU-based hyperbolic SVD algorithm
A one-sided Jacobi hyperbolic singular value decomposition (HSVD) algorithm,
using a massively parallel graphics processing unit (GPU), is developed. The
algorithm also serves as the final stage of solving a symmetric indefinite
eigenvalue problem. Numerical testing demonstrates the gains in speed and
accuracy over sequential and MPI-parallelized variants of similar Jacobi-type
HSVD algorithms. Finally, possibilities of hybrid CPU--GPU parallelism are
discussed.Comment: Accepted for publication in BIT Numerical Mathematic
Novel Modifications of Parallel Jacobi Algorithms
We describe two main classes of one-sided trigonometric and hyperbolic
Jacobi-type algorithms for computing eigenvalues and eigenvectors of Hermitian
matrices. These types of algorithms exhibit significant advantages over many
other eigenvalue algorithms. If the matrices permit, both types of algorithms
compute the eigenvalues and eigenvectors with high relative accuracy.
We present novel parallelization techniques for both trigonometric and
hyperbolic classes of algorithms, as well as some new ideas on how pivoting in
each cycle of the algorithm can improve the speed of the parallel one-sided
algorithms. These parallelization approaches are applicable to both
distributed-memory and shared-memory machines.
The numerical testing performed indicates that the hyperbolic algorithms may
be superior to the trigonometric ones, although, in theory, the latter seem
more natural.Comment: Accepted for publication in Numerical Algorithm